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Abstract 

Quality assurance is one the most important challenges 
in crowdsourcing. Assigning tasks to several workers 
to increase quality through redundant answers can be 
expensive if asking homogeneous sources. This limita¬ 
tion has been overlooked by current crowdsourcing plat¬ 
forms resulting therefore in costly solutions. In order 
to achieve desirable cost-quality tradeoffs it is essential 
to apply efficient crowd access optimization techniques. 

Our work argues that optimization needs to be aware of 
diversity and correlation of information within groups 
of individuals so that crowdsourcing redundancy can 
be adequately planned beforehand. Based on this intu¬ 
itive idea, we introduce the Access Path Model (APM), 
a novel crowd model that leverages the notion of access 
paths as an alternative way of retrieving information. 

APM aggregates answers ensuring high quality and 
meaningful confidence. Moreover, we devise a greedy 
optimization algorithm for this model that finds a prov- 
ably good approximate plan to access the crowd. We 
evaluate our approach on three crowdsourced datasets 
that illustrate various aspects of the problem. Our results 
show that the Access Path Model combined with greedy 
optimization is cost-efficient and practical to overcome 
common difficulties in large-scale crowdsourcing like 
data sparsity and anonymity. 

Introduction 

Crowdsourcing has attracted the interest of many research 
communities such as database systems, machine learning, 
and human computer interaction because it allows humans 
to collaboratively solve problems that are difficult to handle 
with machines only. Two crucial challenges in crowdsourc¬ 
ing independent of the field of application are (i) quality 
assurance and (ii) crowd access optimization. Quality as¬ 
surance provides strategies that proactively plan and ensure 
the quality of algorithms run on top of crowdsourced data. 
Crowd access optimization then supports quality assurance 
by carefully selecting from a large pool the crowd mem¬ 
bers to ask under limited budget or quality constraints. In 
current crowdsourcing platforms, redundancy (i.e. assigning 
the same task to multiple workers) is the most common and 
straightforward way to guarantee quality ( |Karger, Oh, and| 
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Shah 20lT| >. Simple as it is, redundancy can be expensive if 
used without any target-oriented approach, especially if the 
errors of workers show dependencies or are correlated. Ask¬ 
ing people whose answers are expected to converge to the 
same opinion is neither efficient nor insightful. For exam¬ 
ple, in a sentiment analysis task, one would prefer to con¬ 
sider opinions from different non-related groups of interests 
before forming a decision. This is the basis of the diversity 
principle introduced by ( |Surowiecki 2005[ l. The principle 
states that the best answers are achieved from discussion and 
contradiction rather than agreement and consensus. 

In this work, we incorporate the diversity principle in 
a novel crowd model, named Access Path Model (APM), 
which seamlessly tackles quality assurance and crowd ac¬ 
cess optimization and is applicable in a wide range of use 
cases. It explores crowd diversity not on the individual 
worker level but on the common dependencies of workers 
while performing a task. In this context, an access path is a 
way of retrieving a piece of information from the crowd. The 
configuration of access paths can be based on various crite¬ 
ria depending on the task: (i) workers’ demographics (e.g. 
profession, group of interest, age) (ii) the source of informa¬ 
tion or the tool that is used to find the answer (e.g. phone call 
vs. web page, Bing vs. Google) (iii) task design (e.g. time of 
completion, user interface) (iv) task decomposition (e.g. part 
of the answers, features). 

Example 1. Peter and Aanya natively speak two different 
languages which they would like to teach to their young chil¬ 
dren. At the same time, they are concerned how this mul¬ 
tilingual environment affects the learning abilities of their 
children. More specifically, they want to answer the question 
“Does raising children bilingually cause language delay?”. 
To resolve their problem, they can ask three different groups 
of people (access paths): 


Access Path 

Error rate 

Cost 

Pediatricians 

10% 

$20 

Logopedists 

15% 

$15 

Other parents 

25% 

$10 


Table 1: Access path configuration for Example 1 
Figure [T illustrates the given situation with respect to the 
Access Path Model. In this example, each of the groups ap¬ 
proaches the problem from a different perspective and has 
different associated error rates and costs. Considering that 





















Figure 1: APM for crowdsourcing a medical question 


Peter and Aanya have a limited budget to spend and can ask 
more than one person on the same access path, they are inter¬ 
ested in finding the optimal combination of access paths that 
will give them the most insightful information for their bud¬ 
get constraints. Throughout this paper, a combination of ac¬ 
cess paths will be referred to as an access plan and it defines 
how many different people to ask on each available access 
path. Our model aims at helping general requesters in crowd¬ 
sourcing platforms to find optimal access plans and appro¬ 
priately aggregate the collected data. Results from experi¬ 
ments on real-world crowdsourcing show that a pre-planned 
combination of diverse access paths indeed overperforms 
pure ( i.e. single access path) access plans, random selec¬ 
tion, and equal distribution of budget across access paths. 
The main finding is that diversity is a powerful mean that 
matters for quality assurance. 

Contributions 

Previous work on quality assurance and crowd access op¬ 
timization focuses on two different approaches: majority- 
based strategies and individual models. Majority voting is 
oblivious to personal characteristics of crowd workers and is 
therefore limited in terms of optimization. Individual models 
instead base their decisions on the respective performance of 
each worker targeting those with the best accuracy ( |Dawid 
land Skene 1979[ |Whitehill et al. 2009| ). These models are 
useful for spam detection and pricing schemes but do not 
guarantee answer diversity and might fall into partial con¬ 
sensus traps. 

As outlined in Table [2] the APM is a middle-ground so¬ 
lution between these two choices and offers several advan¬ 
tages. First, it is aware of answer diversity which is partic¬ 
ularly important for requests without an established ground 
truth. Second, since it manages group-based answer corre¬ 
lations and dependencies, it facilitates efficient optimization 
of redundancy. Third, the APM is a practical model for cur¬ 
rent crowdsourcing marketplaces where due to competition 
the availability of a particular person is never guaranteed or 
authorships may be hidden for privacy reasons. Last, its pre¬ 
dictions are mapped to meaningful confidence levels which 
can simplify the interpretation of results. 

In summary, this work makes the following contributions: 

• Modeling the crowd for quality assurance. We de¬ 
sign the Access Path Model as a Bayesian Network that 
through the usage of latent variables is able to capture 
and utilize crowd diversity from a non-individual point of 
view. The APM can be applied even if the data is sparse 
and crowd workers are anonymous. 



Majority 

Voting 

Individual 

Models 

Access Path 
Model 

Diversity 

awareness 

X 

/ 

/ 

Cost-efficient 

optimization 

X 

X 

/ 

Sparsity 

Anonymity 

/ 

X 

/ 

Meaningful 

confidence 

/ 

X 

/ 


Table 2: Comparison of APM with current approaches. 


• Crowd access optimization. We use an information- 
theoretic objective for crowd access optimization. We 
prove that our objective is submodular, allowing us to 
adopt efficient greedy algorithms with strong guarantees. 

• Real-world experiments. Our extensive experiments 
cover three different domains: Answering medical ques¬ 
tions, sport events prediction and bird species classifica¬ 
tion. We compare our model and optimization scheme 
with state of the art techniques and show that it makes 
robust predictions with lower cost. 


Problem Statement 

In this work, we identify and address two closely related 
problems: (1) modeling and aggregating diverse crowd an¬ 
swers which we call the crowdsourced predictions problem, 
and (2) optimizing the budget distribution for better quality 
referred to as access path selection problem. 

Problem 1 (Crowdsourced Predictions). Given a 
task represented by a random variable Y, and a set of an¬ 
swers from W workers represented by random variables 
X \,..., X\y, the crowdsourced prediction problem is to find 
a high-quality prediction of the outcome of task Y by aggre¬ 
gating these votes. 


Quality criteria. A high-quality prediction is not only ac¬ 
curate but should also be linked to a meaningful confidence 
score which is formally defined as the likelihood of the pre¬ 
diction to be correct. This property simplifies the interpreta¬ 
tion of predictions coming from a probabilistic model. For 
example, if a doctor wants to know whether a particular 
medicine can positively affect the improvement of a disease 
condition, providing a raw yes/no result answer is not suffi¬ 
ciently informative. Instead, it is much more useful to asso¬ 
ciate the answer with a trustable confidence score. 


Requirements and challenges. To provide high quality pre¬ 
dictions, it is essential to precisely represent the crowd. The 
main aspects to be represented are (i) the conditional de¬ 
pendence of worker answers within access paths given the 
task and (ii) the conditional independence of worker answers 
across access paths. As we will show in this paper, modeling 
such dependencies is also crucial for efficient optimization. 
Another realistic requirement concerns the support for data 
sparsity and anonymity. Data sparsity is common in crowd¬ 
sourcing (Venanzi et al. 2014) and occurs when the number 
of tasks that workers solve is not sufficient to estimate their 
errors which can negatively affect quality. In other cases, the 
identity of workers is not available, but it is required to make 
good predictions based on non-anonymized features. 





































Problem 2 (Access Path Selection). Given a task rep¬ 
resented by a random variable Y, that can be solved by the 
crowd following N different access paths denoted with the 
random variables Z-\,, Zjy, using a maximum budget B, 
the access path selection problem is to find the best possible 
access plan Sf, est that leads to a high-quality prediction of 
the outcome of task Y. 

An access plan defines how many different people are 
chosen to complete the task from each access path. In Ex¬ 
ample [T] we will ask one pediatrician, two logopedists and 
three different parents if the access plan is S = [1,2,3]. 
Each access plan is associated with a cost c(S ) and quality 
q(S). For example, c(S) = 1 c * * ' £[*] = $80 where a 

is the cost of getting one single answer through access path 
Zi. In these terms, the access path selection problem can be 
generally formulated as: 

N 

Sbest = arg max^S 1 ) s.t. V' c, • S[i] < B (1) 

This knapsack maximization problem is NP-Hard even 
for submodular functions ( fFeige 1998j ). Hence, designing 
bounded and efficient approximation schemes is useful for 
realistic crowd access optimization. 


Access Path Model 

The crowd model presented in this section aims at fulfill¬ 
ing the requirements specified in the definition of Problem[l] 
(Crowdsourced Prediction) and enables our method 
to learn the error rates from historical data and then accord¬ 
ingly aggregate worker votes. 


Access Path Design 

Due to the variety of problems possible to crowdsource, an 
important step concerns the design of access paths. The ac¬ 
cess path notion is a broad concept that can accommodate 
various situations and may take different shapes depending 
on the task. Below we describe a list of viable configurations 
that can be easily applied in current platforms. 

• Demographic groups. Common demographic character¬ 
istics (location, gender, age) can establish strong statisti¬ 


cal dependencies of workers’ answers (Kazai, Kamps, and 
|Milic-Frayling 2012) . Such groups are particularly diverse 

for problems like sentiment analysis or product evaluation 
and can be retrieved from crowdsourcing platforms as part 
of the task, worker information, or qualification tests. 

• Information sources. For data collection and integration 
tasks, the data source being used to deduplicate or match 
records (addresses, business names etc.) is the primary 
cause of error or accuracy (Pochampally et al. 20141. 

• Task design. In other cases, the answer of a worker may 
be psychologically affected by the user interface design. 
For instance, in crowdsourced sorting, a worker may rate 
the same product differently depending on the scaling sys¬ 
tem (stars, 1-10 etc.) or other products that are part of the 
same batch ( |Parameswaran et al. 2014) . 

• Task decomposition. Often, complicated problems are 
decomposed into smaller ones. Each subtask type can 


serve as an access path. For instance, in the bird classifi¬ 
cation task that we study later in our experiments, workers 
can resolve separate features of the bird ii.e. color, beak 
shape etc.) rather than its category. 


In these scenarios, the access path definition natively comes 
with the problem or the task design. However, there are sce¬ 
narios where the structure is not as evident or more than one 
grouping is applicable. Helpful tools in this regard include 
graphical model structure learning based on conditional 
independence tests ( De Campos 2006]) and information- 
theoretic group selection (Li, Zhao, and Fuxman 2014). 


Architectural implications. We envision access path de¬ 
sign as part of the quality assurance and control module for 
new crowdsourcing frameworks or, in our case, as part of 
the query engine in a crowdsourced database ( |Franklin~et| 
al. 2011] ). In the latter context, the notion of access paths is 
one of the main pillars in query optimization for traditional 
databases ( jSelinger et al. 1979| where access path selection 
(e.g. sequential scan or index) has significant impact on the 
query response time. In addition, in a crowdsourced database 
the access path selection also affects the quality of query re¬ 
sults. In such an architecture, the query optimizer is respon¬ 
sible for (i) determining the optimal combination of access 
paths as shown in the following section, and (ii) forwarding 
the design to the UI creation. The query executor then col¬ 
lects the data from the crowd and aggregates it through the 
probabilistic inference over the APM. 


Alternative models 

Before describing the structure of the Access Path Model, 
we first have a look at other alternative models and their be¬ 
havior with respect to quality assurance. Table [3] specifies 
the meaning of each symbol as used throughout this paper. 
Majority Vote (MV). Being the simplest of the models and 
also the most popular one, majority voting is able to produce 
fairly good results if the crowdsourcing redundancy is suf¬ 
ficient. Nevertheless, majority voting considers all votes as 
equal with respect to quality and can not be integrated with 
any optimization scheme other than random selection. 

Naive Bayes Individual (NBI). This model assigns indi¬ 
vidual error rates to each worker and uses them to weigh 
the incoming votes and form a decision (Figure [2]). In cases 
when the ground truth is unknown, the error estimation 
is carried out through an EM Algorithm as proposed by 
( Dawid and Skene 1979} . Aggregation ( i.e. selecting the best 
prediction) is then performed through Bayesian inference. 
For example, for a set of votes x f coming from W differ¬ 
ent workers X\,..., X w the most likely outcome among 
all candidate outcomes y c is computed as prediction = 
argmax 9cg y p(y c \xt), whereas the joint probability of a 
candidate answer y c and the votes x t is: 

w 

p{y c , Xt) = p(y) n p(x wt \y c ) (2) 

W = 1 

The quality of predictions for this model highly depends on 
the assumption that each worker has solved a fairly suffi¬ 
cient number of tasks. This assumption generally does not 
























Symbol 

Description 

Y 

random variable of the crowdsourced task 

x w 

random variable of worker w 

w 

number of workers 

Zi 

latent random variable of access path i 

Xij 

random variable of worker j in access path i 

N 

number of access paths 

B 

budget constraint 

S 

access plan 

S[i] 

no. of votes from access path i in plan S 

Ci 

cost of access path i 

D 

training dataset 

s < y,x > 

instance of task sample in a dataset 

e 

parameters of the Access Path Model 


Table 3: Symbol description 


hold for open crowdsourcing markets where stable partici¬ 
pation of workers is not guaranteed. As we show in the ex¬ 
perimental evaluation, this is harmful not only for estimat¬ 
ing the error rates but also for crowd access optimization 
because access plans might not be imlplementable or have a 
high response time. Furthermore, even in cases of fully com¬ 
mitted workers, NBI does not provide the proper logistics 
to optimize the budget distribution since it does not capture 
the shared dependencies between the workers. Last, due to 
the Naive Bayes inference which assumes conditional inde¬ 
pendence between each pair of workers, predictions of this 
model are generally overconfident. 

Access Path based models 

Access Path based models group the answers of the crowd 
according to the access path they originate from. We first 
describe a simple Naive Bayes version of such a model and 
then elaborate on the final design of the APM. 

Naive Bayes for Access Paths (NBAP). For correcting 
the effects of non-stable participation of individual workers 
we first consider another alternative, similar to our original 
model, presented in Figure [3] The votes of the workers here 
are grouped according to the access path. For inference pur¬ 
poses then, each vote Xjj is weighed with the average error 
rate t\ of the access path it comes from. In other words, it is 
assumed that all workers within the same access path share 
the same error rate. As a result, all votes belonging to the 
same access path behave as a single random variable, which 
enables the model to support highly sparse data. Yet, due 
to the similarity with NBI and all Naive Bayes classifiers, 
NBAP cannot make predictions with meaningful confidence 
especially when there exists a large number of access paths. 

Access Path Model overview. Based on the analysis of pre¬ 
vious models, we propose the Access Path Model as pre¬ 
sented in Figure[4j which shows an instantiation for three ac¬ 
cess paths. We design the triple <task, access path, worker> 
as a hierarchical Bayesian Network in three layers. 

Layer 1. Variable Y in the root of the model represents the 
random variable modeling the real outcome of the task. 
Layer 2. This layer contains the random variables modeling 
the access paths Z±, Z 2 , Z 3 . Each access path is represented 
as a latent variable, since its values are not observable. Due 
to the tree structure, every pair of access paths is condition¬ 
ally independent given Y while the workers that belong to 



Figure 2: Naive Bayes Individual - NBI. 



Figure 3: Naive Bayes Model for Access Paths - NBAP. 



Figure 4: Bayesian Network Model for Access Paths - APM. 


the same access path are not. The conditional independence 
is the key of representing diversity by implementing there¬ 
fore various probabilistic channels. Their purpose is to dis¬ 
tinguish the information that can be obtained from the work¬ 
ers from the one that comes from the access path. 

Such enhanced expressiveness of this auxiliary layer over 
the previously described NBAP model avoids overconfident 
predictions as follows. Whenever a new prediction is made, 
the amount of confidence that identical answers from dif¬ 
ferent workers in the same access path can bring is first 
blocked by the access path usage (i.e. the latent variable). If 
the number of agreeing workers within the same access path 
increases, confidence increases as well but not at the same 
rate as it happens with NBI. Additional workers contribute 
only with their own signal, while the access path signal has 
already been taken into consideration. In terms of optimiza¬ 
tion, this property of the APM makes a good motivation for 
combining various access paths within the same plan. 

Layer 3. The lowest layer contains the random variables X 
modeling the votes of the workers grouped by the access 
path they are following. For example, Xij is the j-th worker 
on the i-th access path. The incoming edges represent the 
error rates of workers conditioned by their access paths. 

Parameter learning. The purpose of the training stage is to 
learn the parameters of the model, i.e. the conditional prob¬ 
ability of each variable with respect to its parents that are 
graphically represented by the network edges in Figure [4] 
We will refer to the set of all model parameters as 9. More 
specifically, 9 z ay represents the table of conditional error 
probabilities for the ?'-th access path given the task Y, and 
6 x, :l | z, represents the table of conditional error probabilities 
for the j-th worker given the i-th access path. 

For a dataset D with historical data of the same type of 
task, the parameter learning stage finds the maximum likeli- 












































hood estimate Omle = arg max e p(D\9). According to our 
model, the joint probability of a sample st factorizes as: 

N S fc [i] 

p(sk\ 0 ) =p{yk\0) (p(zik\Vk,0) p(xi jk \z ik ,d)j 
*=i V 3 =i 

(3) 

where is the number of votes in access path Z t for 
the sample. As the access path variables Z, are not observ¬ 
able, we apply an Expectation Maximization (EM) algo¬ 
rithm ( (Dempster, Laird, and Rubin 1977] ) to find the best 
parameters. Notice that applying EM for the network model 
in Figure [4] will learn the parameters for each worker in the 
crowd. This scheme works if the set of workers involved in 
the task is sufficiently stable to provide enough samples for 
computing their error rates (i.e. 0 Xlj \z,) and if the worker id 
is not hidden. As in many of the crowdsourcing applications 
(as well as in our experiments) this is not always the case, 
we share the parameters of all workers within an access path. 
This enables us to later apply on the model an optimization 
scheme agnostic about the identity of workers. The general¬ 
ization is optional for the APM and obligatory for NBAP. 

Training cost analysis. The amount of data needed to train 
the APM is significantly lower than what individual models 
require which results in a faster learning process. The reason 
is that the APM can benefit even from infrequent participa¬ 
tion of individuals X it to estimate accurate error rates for 
access paths Z, t . Moreover, sharing the parameters of work¬ 
ers in the same access path reduces the number of param¬ 
eters to learn from W for individual models to 2N for the 
APM which is typically several orders of magnitude lower. 

Inference. After parameter learning, the model is used to in¬ 
fer the outcome of a task using the available votes on each 
access path. As in previous models, the inference step com¬ 
putes the likelihood of each candidate outcome y c £ Y 
given the votes in the test sample x t and chooses the most 
likely candidate as prediction = argmax yc6 y p(y c \xt). As 
the test samples contain only the values for the variables A', 
the joint probability between the candidate outcome and the 
test sample is computed by marginalizing over all possible 
values of Z t (Eq. [4]>. For a fixed cardinality of Z,, the com¬ 
plexity of inferring the most likely prediction is 0 (NM). 

N S t [i] 

p(yc,x t ) =p{y c ) n( E p{z\Vc) P( x ijt\z)) ( 4 ) 

i=i ze{o,i} 3 =1 

The confidence of the prediction maps to the likelihood that 
the prediction is accurate /;(predictiou|.c t ). Marginalization 
in Equation[4 is the technical step that avoids overconfidence 
by smoothly blocking the confidence increase when similar 
answers from the same access path are observed. 

Crowd Access Optimization 

Crowd access optimization is crucial for both paid and non- 
paid of crowdsourcing. While in paid platforms the goal is 
to acquire the best quality for the given monetary budget, in 
non-paid applications the necessity for optimization comes 


from the fact that highly redundant accesses might decrease 
user satisfaction and increase latency. In this section, we de¬ 
scribe how to estimate the quality of access plans and how 
to choose the plan with the best expected quality. 


Information Gain as a measure of quality 

The first step of crowd access optimization is estimating the 
quality of access plans before they are executed. One attempt 
might be to quantify the accuracy of individual access paths 
in isolation, and choose an objective function that prefers 
the selection of more accurate access paths. However, due to 
statistical dependencies of responses within an access path 
(e.g., correlated errors in the workers’ responses), there is 
diminishing returns in repeatedly selecting a single access 
path. To counter this effect, an alternative would be to de¬ 
fine the quality of an access plan as a measure of diversity 
( Hui and Li 2015| >. For example, we might prefer to equally 
distribute the budget across access paths. However, some ac¬ 
cess paths may be very uninformative / inaccurate, and op¬ 
timizing diversity alone will waste budget. Instead, we use 
the joint information gain IG(F; S) of the task variable Y 
in our model and an access plan S' as a measurement of plan 
quality as well as an objective function for our optimization 
scheme. Formally, this is is defined as: 

IG(Y-S) = H(Y)-H(Y\S) (5) 

An access plan S determines how many variables X to 
choose from each access path Z % . IG(F; S) measures the 
entropy reduction (as measure of uncertainty) of the task 
variable Y after an access plan S is observed. At the begin¬ 
ning, selecting from the most accurate access paths provides 
the highest uncertainty reduction. However, if better access 
paths are exhausted (i.e., accessed relatively often), asking 
on less accurate ones reduces the entropy more than con¬ 
tinuing to ask on previously explored paths. This situation 
reflects the way how information gain explores diversity and 
increases the prediction confidence if evidence is retrieved 
from independent channels. Based on this analysis, infor¬ 
mation gain naturally trades accuracy and diversity. While 
plans with high information gain do exhibit diversity, this is 
only a means for achieving high predictive performance. 

Information gain computation. The computation of the 
conditional entropy H(Y\S) as part of information gain in 
Equation[5]is a difficult problem, as full calculation requires 
enumerating all possible instantiations of the plan. Formally, 
the conditional entropy can be computed as: 

h(y\s)= p( x > y) lo S < 6 ) 

veYxeXs P[X ’ y) 


Xs refers to all the possible assignments that votes can take 
according to plan S. We choose to follow the sampling ap¬ 
proach presented in (Krause and Guestrin 2005a) which ran¬ 
domly generates samples satisfying the access plan accord¬ 
ing to our Bayesian Network model. The final conditional 
entropy will then be the average value of the conditional en¬ 
tropies of the generated samples. This method is known to 
provide absolute error guarantees for any desired level of 
confidence if enough samples are generated. Moreover, it 












runs in polynomial time if sampling and probabilistic infer¬ 
ence can also be done in polynomial time. Both conditions 
are satisfied by our model due to the tree-shaped configura¬ 
tion of the Bayesian Network. They also hold for the Naive 
Bayes baselines as simpler tree versions of the APM. 

Submodularity of information gain. Next, we derive the 
submodularity property of our objective function based on 
information gain in Equation [5] The property will then be 
leveraged by the greedy optimization scheme in proving 
constant factor approximation bounds. A submodular func¬ 
tion is a function that satisfies the law of diminishing returns 
which means that the marginal gain of the function decreases 
while incrementally adding more elements to the input set. 

Let V be a finite set. A set function F : 2 V —t K. is sub- 
modular if F{S U {v}) - F{S) > F(S' U {v}) - F(S') 
for all S C S' C V, v S'. For our model, this intuitively 
means that collecting a new vote from the crowd adds more 
information when few votes have been acquired rather than 
when many of them have already been collected. While in¬ 
formation gain is non-decreasing and non-negative, it may 
not be submodular for a general Bayesian Network. Infor¬ 
mation gain can be shown to be submodular for the Naive 
Bayes Model for Access Paths (NBAP) in Figure [3] by ap¬ 
plying the results from (Krause and Guestrin 2005a). Here, 
we prove its submodularity property for the APM Bayesian 
Network shown in Figure [4] Theorem [T] formally states the 
result and below we describe a short sketch of the prooQ 

Theorem 1. The objective function based on information 
gain in Equation^for the Bayesian Network Model for Ac¬ 
cess Paths (APM) is submodular. 

Sketch of Theorem^ For proving Theorem[I we consider a 
generic Bayesian Network with N access paths and M pos¬ 
sible worker votes on each access path. To prove the sub¬ 
modularity of the objective function, we consider two sets 
(plans) S C S' where S' = S U {vj} , i.e., S' contains one 
additional vote from access path j compared to S. Then, we 
consider adding a vote Vi from access path i and we prove 
the diminishing return property of adding v, to S' compared 
to adding to S. The proof considers two cases. When v, and 
Vj belong to different access paths, i.e., i f j, the proof fol¬ 
lows by using the property of conditional independence of 
votes from different access paths given Y and using the “in¬ 
formation never hurts” principle (Cover and Thomas 2012). 
For the case of v, and v 3 belonging to the same access path 
we reduce the network to an equivalent network which con¬ 
tains only one access path Z t and then use the “data process¬ 
ing inequality” principle (Cover and Thomas 2012 1 . □ 


ALGORITHM 1. Greedy Crowd Access Optimization 


This theoretical result is of generic interest for other ap¬ 
plications and a step forward in proving the submodularity 
of information gain for more generic Bayesian networks. 

Optimization scheme 

After having determined the joint information gain as an ap¬ 
propriate quality measure for a plan, the crowd access opti- 
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,6 = 0 


Input: budget B 
Output: best plan Sbest 
Initialization: Sbest =0 
while (3* s.t. b < a) do 
Ubest — 0 
for i = 1 to N do 
S pU re = PurePlan(i) 
if d < B — b then 

A ig = IG(F; Sb es t U Spure 

if > Ubest then 

uLt = A / f; 

Smax = Sbest U Sp 

Sbest ~ Smax 
b — COSt (Sbest') 

return Sbest 


)-I G(Y, Sbest) 


mization problem is to compute: 


S bes t = arg maxIG(y; S) s.t. 
SeS 


N 

E 

i=l 


Cl ■ 5[z] < B (7) 


where S is the set of all plans. An exhaustive search would 
consider |<S| = IIE pl ans out °f which the ones that are 
not feasible have to be eliminated. Nevertheless, efficient ap¬ 
proximation schemes can be constructed given that the prob¬ 
lem is an instance of submodular function maximization un¬ 
der budget constraints (Krause and Guestrin 2005b; Sviri- 
|denko 2004| ). Based on the submodular and non-decreasing 
properties of information gain we devise a greedy technique 
in Algorithm [T] that incrementally finds a local approxima¬ 
tion for the best plan. In each step, the algorithm evaluates 
the benefit-cost ratio U between the marginal information 
gain and cost for all feasible access paths. The marginal in¬ 
formation gain is the improvement of information gain by 
adding to the current best plan one pure vote from one ac¬ 
cess path. In the worst case, when all access paths have 
unit cost, the computational complexity of the algorithm is 
0(GN 2 MB), where G is the number of generated samples 
for computing information gain. 

Theoretical bounds of greedy optimization. We now em¬ 
ploy the submodularity of information gain in our Bayesian 
network to prove theoretical bounds of the greedy optimiza¬ 
tion scheme. For the simple case of unit cost access paths, 
the greedy selection in Algorithm [I] guarantees a utility of 
at least (1 — 1 /e) (= 0.63) times the one obtained by op¬ 
timal selection denoted by Opt (Nemhauser, Wolsey, and 
Fisher 1978). However, the greedy selection scheme fails to 


’Full proofs available at http://arxiv.org/abs/1508.01951 


provide approximation guarantees for the general setting of 
varying costs (|Khuller, Moss, and Naor 1999). 

Here, we exploit the following realistic property about the 
costs of the access paths and allocated budget to prove strong 
theoretical guarantees about our Algorithm [T| We assume 
that the allocated budget is large enough compared to the 
costs of the access paths. Formally stating, we assume that 
the cost of any access path Ci is bounded away from total 
budget B by factor 7 , i.e., Ci < 7 ■ B Mi £ N}, 

































where 7 £ (0, 1). We state the theoretical guarantees of the 
Algorithm [l] in Theorem[2]belowEl 

Theorem 2. The Greedy optimization in Algorithm [7] 
achieves a utility of at least ^1 — e(1 1 _ 7) ^ times that obtained 
by the optimal plan OPT, where 7 = maxj g .n ...,N} %• 

For instance, Algorithm[I]achieves an approximation ratio 
of at least 0.39 for 7 = 0.5, and 0.59 for 7 = 0.10. 

Sketch of Theorem^ We follow the structure of the proof 
from ( |Khuller, Moss' and Naor 1999)|Sviridenko 2004| l. The 
key idea is to use the fact that the budget spent by the al¬ 
gorithm at the end of execution when it can not add an ele¬ 
ment to the solution is at least (B — max i c[i....,Ar] cf), which 
is lower-bounded by B( 1 — 7 ). This lower bound on the 
spent budget, along with the fact that the elements are picked 
greedily at every iteration leads to the desired bounds. □ 

These results are of practical importance in many other 
applications as the assumption of non-unit but bounded costs 
with respect to budget often holds in realistic settings. 

Experimental Evaluation 

We evaluated our work on three real-world datasets. The 
main goal of the experiments is to validate the proposed 
model and the optimization technique. We compare our 
approach with other state of the art alternatives and re¬ 
sults show that leveraging diversity through the Access Path 
Model combined with the greedy crowd access optimization 
technique can indeed improve the quality of predictions. 
Metrics. The comparison is based on two main metrics: ac¬ 
curacy and negative log-likelihood. Accuracy corresponds 
to the percentage of correct predictions. Negative log- 
likelihood is computed as the sum over all test samples of 
the negative log-likelihood that the prediction is accurate. 
Hence, it measures not only the correctness of a model but 
also its ability to output meaningful confidence. 

-logLikelihood = — log p (prediction = yt\xt) ( 8 ) 

St 

The closer a prediction is to the real outcome the lower is 
its negative log-likelihood. Thus, a desirable model should 
offer low values of negative log-likelihood. 

Dataset description 

All the following datasets come from real crowdsourcing 
tasks. For experiments with restricted budget, we repeat the 
learning and prediction process via random vote selection 
and k-fold cross-validation. 

CUB-200. The dataset ( |Welinder et al. 20T0j > was built as a 
large-scale data collection for attribute-based classification 
of bird images on Amazon Mechanical Turk (AMT). Since 
this is a difficult task even for experts, the crowd workers are 
not directly asked to determine the bird category but whether 
a certain attribute is present in the image. Each attribute (e.g., 
yellow beak) brings a piece of information for the problem 
and we treat them as access paths. The dataset contains 5-10 
answers for each of the 288 available attributes. We keep the 


cost of all access paths equal as there was no clear evidence 
of attributes that are more difficult to distinguish than others. 
The total number of answers is approximately 7.5 x 10 6 . 

MedicalQA. We gathered 100 medical questions and for¬ 
warded them to AMT. Workers were asked to answer the 
questions after reading in specific health forums categorized 
as in Table[4]which we then design as access paths. 255 peo¬ 
ple participated in our experiment. The origin of the answer 
was checked via an explanation url provided along with the 
answer as a sanity check. The tasks were paid equally to pre¬ 
vent the price of the task to affect the quality of the answers. 
For experimental purposes, we assign an integer cost of (3, 
2 , 1 ) based on the reasoning that in real life doctors are more 
expensive to ask, followed by patients and common people. 



Description 

Forums 

(1) 

Answers from doctors 

www.webmd.com 

www.medhelp.org 

(2) 

Answers from patients 

www.patient.co.uk 

www.ehealthforum.com 

(3) 

General Q&A forum 

www.quora.com 

www.wiki.answers.com 


Table 4: Access Path Design for MedicalQA dataset. 


ProbabilitySports. This data is based on a crowdsourced 
betting competition (www.probabilitysports.com) on NFL 
games. The participants voted on the question: “Is the home 
team going to win?” for 250 events within a season. There 
are 5,930 players in the entire dataset contributing with 
1,413,534 bets. We designed the access paths based on the 
accuracy of each player in the training set which does not 
reveal information about the testing set. Since the players’ 
accuracy in the dataset follows a normal distribution, we di¬ 
vide this distribution into three intervals where each interval 
corresponds to one access path (worse than average, aver¬ 
age, better than average). As access paths have a decreasing 
error rate, we assign them an increasing cost ( 2 ,3,4). 

Model evaluation 

For evaluating the Access Path Model independently of the 
optimization, we first show experiments where the budget 
is equally distributed across access paths. The question we 
want to answer here is: “How robust are the APM predic¬ 
tions in terms of accuracy and negative log-likelihood?” 

Experiment 1: Constrained budget. Figure[5]illustrates the 
effect of data sparsity on quality. We varied the budget and 
equally distributed it across all access paths. We do not show 
results from CUB-200 as the maximum number of votes per 
access path in this dataset is 5-10. 

MedicalQA. The participation of workers in this experiment 
was stable, which allows for a better error estimation. Thus, 
as shown in Figure [5ja), for high redundancy NBI reaches 
comparable accuracy with the APM although the negative 
log-likelihood dramatically increases. For lower budget and 
high sparsity NBI cannot provide accurate results. 

ProbabilitySports. Figure |5]b) shows that while the im¬ 
provement of the APM accuracy over NBI and MV is stable, 
NBAP starts facing the overconfidence problem while bud¬ 
get increases. NBI exhibits low accuracy due to very high 













(a) MedicalQA (b) ProbabilitySports (year = 2002) 

Figure 5: Accuracy and negative log-likelihood for equally distributed budget across all access paths. The negative log-likelihood of Naive 
Bayes models deteriorates for high budget while for the APM it stays stable. NBI is not competitive due to data sparsity. 


sparsity even for sufficient budget. Majority Vote fails to 
produce accurate predictions as it is agnostic to error rates. 

Optimization scheme evaluation 

In these experiments, we evaluate the efficiency of the 
greedy approximation scheme to choose high-quality plans. 
For a fair comparison, we adapted the same scheme to NBI 
and NBAP. We will use the following accronyms for the 
crowd access strategies: OPT (optimal selection), GREEDY 
(greedy approximation), RND (random selection), BEST 
(votes from the most accurate access path), and EQUAL 
(equal distribution of votes across access paths). 

Experiment 2: Greedy approximation and diversity. The 

goal of this experiment is to answer the questions: “How 
close is the greedy approximation to the theoretical optimal 
solution?” and “How does information gain exploit diver¬ 
sity?”. Figure [6] shows the development of information gain 
for the optimal plan, the greedily approximated plan, the 
equal distribution plan, and three pure plans that take votes 
only from one access path. The quality of GREEDY is very 
close to the optimal plan. The third access path in Probabil¬ 
itySports (containing better than average users) reaches the 
highest information gain compared to the others. Neverthe¬ 
less, its quality is saturated for higher budget which encour¬ 
ages the optimization scheme to select other access paths as 
well. Also, we notice that the EQUAL plan does not reach 
optimal values of information gain although it maximizes 
diversity. Next, we show that the quality of predictions can 
be further improved if diversity is instead planned by using 
information gain as an objective. 

Experiment 3: Crowd access optimization. This experi¬ 
ment combines together both the model and the optimiza¬ 
tion technique. The main question we want to answer here 
is: “What is the practical benefit of greedy optimization on 
the APM w.r.t. accuracy and negative log-likelihood?” 

CUB-200. For this dataset (Figure |7Ja)) where the access 
path design is based on attributes, the discrepancy between 
NBAP and the APM is high and EQUAL plans exhibit low 
quality as not all attributes are informative for all tasks. 

ProbabilitySports. Access Path based models (APM and 
NBAP) outperform MV and NBI. NBI plans target concrete 
users in the competition. Hence, their accuracy for budget 
values less than 10 is low as not all targeted users voted for 


OPT —•— GREEDY 

—•— API —■— AP2 

A AP3 (BEST) EQUAL 

■ 10 -2 


| | APl(COST=2) 

| | AP2(COST=3) 

| | AP3(COST=4) 




Budget 

Figure 6: Information gain and budget distribution for Probabili¬ 
tySports (year=2002). As budget increases, GREEDY access plans 
exploit more than one access path. 


all events. Since access paths are designed based on the ac¬ 
curacy of workers, EQUAL plans do not offer a clear im¬ 
provement while NBAP is advantaged in terms of accuracy 
by its preference to select the most accurate access paths. 

Experiment 5: Diversity impact. This experiment is de¬ 
signed to study the impact of diversity and conditional de¬ 
pendence on crowd access optimization, and finally answer 
the question: “How does greedy optimization on the APM 
handle diversity?”. One form of such dependency is within 
access path correlation. If this correlation holds, workers 
agree on the same answer. We experimented by varying the 
shared dependency within the access path as follows: Given 
a certain probability p, we decide whether a vote should fol¬ 
low the majority vote of existing answers in the same access 
path. For example, for p = 0.4, 40% of the votes will follow 
the majority vote decision of the previous workers and the 
other 60% will be withdrawn from the real crowd votes. 

Figure [8j a) shows that the overall quality drops when de¬ 
pendency is high but the Access Path Model is more robust 
to it. NBAP instead, due to overconfidence, accumulates all 
votes into a single access path which dramatically penalizes 
its quality. APM+BEST applies the APM to votes selected 
from the access path with the best accuracy, in our case doc¬ 
tors’ answers. Results show that for p > 0.2, it is prefer¬ 
able to not only select from the best access path but to dis¬ 
tribute the budget according to the GREEDY scheme. Fig- 
ure[8|b) shows results from the same experiment for p = 0.4 
and varying budget. APM+GREEDY outperforms all other 
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Figure 7: Crowd access optimization results for varying budget. Data sparsity and non-guaranteed votes are better handled by the APM model 
also for optimization purposes, leading to improved accuracy and confidence. 



(a) MedicalQA, B = 25 


] APM + GREEDY I I APM + BEST I••« --I APM + EQUAL 
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(b) MedicalQA, p = 0.4 


Figure 8: Diversity and dependence impact on optimization. As the common dependency of workers within access paths increases, investing 
the whole budget on the best access path or randomly is not efficient. 


methods reaching a stable quality at B = 30 which moti¬ 
vates the need to design techniques that can stop the crowd¬ 
sourcing process if no new insights are possible. 

Discussion 

We presented experiments based on three different and chal¬ 
lenging crowdsourced datasets. However, our approach and 
our results are of general purpose and are not tailored to any 
of the datasets. The main findings are: 

• In real-world crowdsourcing the unrealistic assumption of 
pairwise worker independence poses limitations to qual¬ 
ity assurance and increases the cost of crowdsourced solu¬ 
tions based on individual and majority vote models. 

• Managing and exploiting diversity with the APM ensures 
quality in terms of accuracy and more significantly neg¬ 
ative log-likelihood. Crowd access optimization schemes 
on top of this perspective are practical and cost-efficient. 

• Surprisingly, access plans that combine various access 
paths make better predictions than plans which spend the 
whole budget in a single access path. 

Related Work 


|Rubin 1977| > to obtain maximum likelihood estimates for 
the observer variation when ground truth is missing or par¬ 
tially available. This has served as a foundation for several 
following contributions (jlpeirotis, Provost, and Wang 2010' 
Raykar et al. 2010; [Whitehill et al. 2009||Zhou et al. 20l2| >, 


placing David and Skene’s algorithm in a crowdsourcing 
context and enriching it for building performance-sensitive 
pricing schemes. The APM model enhances these quality 
definitions by leveraging the fact that the error rates of work¬ 
ers are directly affected by the access path that they follow, 
which allows for efficient optimization. 


Query and crowd access optimization. In crowdsourced 
databases, quality assurance and crowd access optimization 
are envisioned as part of the query optimizer, which needs to 
estimate the query plans not only according to the cost but 
also to their accuracy and late ncy. Previous work ([Franklin et 
|al. 2011||Marcus et al. 201l||Parameswaran et al. 2012| i fo- 

cuses on building declarative query languages with support 
for processing crowdsourced data. The proposed optimizers 
define the execution order of operators in query plans and 
map crowdsourcable operators to micro-tasks. In our work, 
we propose a complementary approach by ensuring the qual¬ 
ity of each single operator executed by the crowd. 


The reliability of crowdsourcing and relevant optimization 
techniques are longstanding issues for human computation 
platforms. The following directions are closest to our study: 


Crowd access optimization is similar to the expert selec¬ 
tion problem in decision-making. However, the assumption 
that the selected individuals will answer may no longer hold. 


Quality assurance and control. One of the central works 

Previous studies based on this assumption are (]Karger, Oh, 

in this field is presented by (Dawid and Skene 1979). In an 

and Shah 2011, Ho, Jabbari, and Vaughan 2013j Jung and 

experimental design with noisy observers, the authors use an 

Lease 2013). The proposed methods are nevertheless effec- 

Expectation Maximization algorithm (Dempster, Laird, and tive for task recommendation and performance evaluation. 























































































Diversity for quality. Relevant studies in management sci¬ 
ence ( [Hong and Page 2004[ [Lamberson and Page 2012) 
emphasize diversity and define the notion of types to refer 
to highly correlated forecasters. Another work that targets 
groups of workers is introduced by (Li, Zhao, and Fuxman 
|2014| ). This technique discards groups that do not prove to be 
the best ones. (Venanzi et al. 2014) instead, refers to groups 
as communities and all of them are used for aggregation but 
not for optimization. Other systems like CrowdSearcher by 
(|Brambilla et al. 2014] ) and CrowdSTAR by (Nushi et al. 
|2015[) support cross-community task allocation. 


Conclusion 

We introduced the Access Path Model, a novel crowd model 
that captures and exploits diversity as an inherent prop¬ 
erty of large-scale crowdsourcing. This model lends itself 
to efficient greedy crowd access optimization. The result¬ 
ing plan has strong theoretical guarantees, since, as we 
prove, the information gain objective is submodular in our 
model. The presented theoretical results are of general in¬ 
terest and applicable to a wide range of variable selection 
and experimental design problems. We evaluated our ap¬ 
proach on three real-world crowdsourcing datasets. Exper¬ 
iments demonstrate that our approach can be used to seam¬ 
lessly handle critical problems in crowdsourcing such as 
quality assurance and crowd access optimization even in sit¬ 
uations of anonymized and sparse data. 
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Proof of Theorem |Tj 

In order to prove Theorem [T] we will consider a generic Bayesian 
Network for the Access Path Model (APM) with N access paths 
and each access path associated with M possible votes from work¬ 
ers. Hence, we have following set of random variables to represent 
this network: 

i) Y is the random variable of the crowdsourcing task. 

ii) Z : {Zi,..., Zi ,..., Zff} are the latent random variables of 
the N access paths. 

iii) A' : {A 1? for i £ [1,..., A] and j £ [1,..., M]} represents a 
set of random variables associated with all the workers from the 
access paths. 

The goal is to prove the submodularity property of the set function: 

f(S) = IG(S-,Y) (9) 

i.e., the information gain of Y and SC X w.r.t to set selection S, 
earlier referred to as access plan. We begin by proving the follow¬ 
ing Lemma[T]that establishes the submodularity of the information 
gain in a network with one access path (i.e., N = 1), denoted as 
Zi. 

Lemma 1. The set function f(S) = IG(S;Y) in Equation [5] is 
submodular for the Bayesian Network representing an Access Path 
Model with N = 1 access path denoted by Z\, associated with M 
workers denoted by X : {Xij for j £ [1 

Proof of Lemma^ Figure [9]illustrates the Bayesian Network con¬ 
sidered here with one access path Z\. For the sake of the proof, 
we consider an alternate view of the same network as shown in 
Figure [TO] Here, the auxiliary variable Zij denotes the set of 
first j variables associated with workers’ votes from access path 
Z\, i.e., Zij = {An, A' 12 ,..., A'ij}. This alternate view is 
taken from the following generative process: Z± is first sampled 
given Y, followed by sampling of Zim from Z\, where Zim = 
{An, A' 12 ,..., Aim}- Given Zim, the remaining Zijfj < 
M are just subsets of Zim- We define set Q : {Zij for j £ 
[1,...,M]}. 

One crucial property we use while considering this generative 
process here is that all the X\j are just repeated observations of 
same variable associated with response of a worker from Z\ ac¬ 
cess path and hence they are anonymous and ordering does not 
mater. Note that, querying j workers from Z 1 , i.e. observing S = 
{An ■ ■ ■ Ai,} is equivalent to observing Z\p Given this equiva¬ 
lence of the two representations of Figure[9]and FigureflO] we now 
prove the submodularity of the set function g(A ) = IG(A\ Y) i.e., 
the information gain of Y and A C Q w.r.t to set selection A. 

Note that since Zij C Z x p V j < j', we can alterna¬ 
tively write down A as equivalent to the singleton set given by 
{Zik} where k = argmax^ Z U G A. Also note that, function 
f(S) and g(A) have one to one equivalence given by g(A) = 
/({An ■ • • An,}) where k = argmax^ Z\j £ A. 

To prove submodularity of g, consider sets A C A! C Q and an 
element q £ Q\ A'. Let A = {Z 1 j}, A! = {Z 1 j*} where j' > j 
and q = Zu where l > j'. First, let us consider marginal utility of 
q over A denoted as A g (g| A), given by: 

A 9 (q|A) = g(A U {<?}) - g(A) 

= IG(AU M;A) - IG(A-Y) 

= IG({Zij} U {Z U }- Y) - 7G({Zy}; Y) 

= IG({Z U };Y) - /G({Zii};y) (10) 

= IG(Z U -,Y) - IG(Zi f ,Y) (11) 

= (H(Y) - H{Y\Zu)) ~ (- H(Y ) - H(Y\Z lj ) S j 




Zt 


\ 0 - 0 - 0 - 0^0 

Z n Z\j Z\ji Z\i Zim 

Figure 10: APM Model for N = 1 access path, associated with M 
workers represented with auxiliary variables Zij 

= H(Y\Z lj )-H(Y\Z ll ) 

Step [To] uses the fact that {Zu} U {Zu} is simply equivalent to 
{Zu} as Zij C Zu. Step 1 11 1 replaces singleton sets {Zu} and 
{Zij} by the associated random variables Zu and Zij. Now, to 
prove submodularity, we need to show that A s (g| A) > A g (q\A'), 
given by: 

A fl (?|A) - A g (q\A') 

= (H(Y\Zij) - H{Y\Zu )) - (H(Y\Zij,) - H(Y\Z lt )) 
= H(Y\Zij)-H(Y\Zij,) 

= ( H(Y ) - H(Y\Zij,)) - (tf(Y) - H(Y\Zij)) 

= IG(Zij,-Y) - IG(Zij-Y) 

> 0 ( 12 ) 

Step [l2| uses the “data processing inequality” (Cov er aiid| 
|Thomas 201 2) , which states that post-processing cannot increase 
information, or the mutual information gain between two ran¬ 
dom variables decreases with addition of more intermediate ran¬ 
dom variables in the unidirectional network considered in Fig¬ 
ure [TO] D 

Next, we use the result of Lemma |T] to prove the results for 
generic networks with N access paths. 

Proof of Theorem^ We now consider a generic Bayesian Net¬ 
work for the Access Path Model (APM) with N access paths and 
each access path associated with M possible votes from work¬ 
ers. Again taking the alternate view as illustrated in Figure]!!)] we 
define auxilliary variables Zij denoting a set of first j variables 
associated with workers’ votes from access path Zi, i.e., Zij = 
{A/i, A i2 ,..., A ij}. As before, we define set Q : {Zij for i £ 
























[1,..., N] and j £ [1,..., M}}. The goal is to prove the submod¬ 
ularity over the set function g(A) = IG(A ; Y) i.e., the information 
gain of Y and A C Q w.r.t to set selection A. 

We define Qi : {Zij for j £ [1,..., M ]} V i £ [1,..., N], and 
hence we can write Q = U fL 1 Qi. We can similarly write A = 
U iL\Ai where A, — An Qi. We denote complements of Ai and 
Qi as A^ and Qi respectively, defined as follows: Qi = Q \ Qi 
and A^ = A n QI . 

To prove the submodularity property of g , consider two sets A C 
Q, and A' = AU{s}, as well as an element q £ Q\A'. Let q £ Qi. 
We consider following two cases: 

Case i). s £ Qi{q and s belong to the same access path.) 

Note that, we can write A = Ai U A\ and A' = A[ U AI, as A and 
A' differ only along access path i. Also, let us denote a particular 
realization of the variables in set AI by a). The key idea that we use 
is that for a given realization of AI, the generic Bayesian Network 
with N access paths can be factorized in a similar way as with just 
one access path (FigureflOll, when computing the marginal gains of 
q over Ai and Ai U {s}. 

Again, we need to show A g (q\A) > A g (q\A')\ given by: 

A g (q\A) - A g (q\A') 

= A g (q\Ai U A c i) - A g {q\A'i U AI) 

= E a ^A g (q\Ai,a1) - A g (q\A'i, a*)) (13) 

> 0 (14) 

Step |T3] considers expectation over all the possible realizations 
of random variables in Af. Step[l4]uses the result of Lemma[T]as 
this network for a given realization of AI has the same characteris¬ 
tics as a single access path network where information gain is sub- 
modular. Hence, each term inside the expectation is non-negative, 
proving therefore the desired result. 

Next, we consider the other case when q and s belong to differ¬ 
ent access paths. 

Case ii). s £ QI (q and s belong to different access paths.) First, 
let us consider marginal utility of q over A denoted as A s (g|A), 
given by: 

A g (q\A) = g(A U {<?}) - g(A) 

= IG(A U {</}; Y) — IG(A; Y) 

= ( H(A U {q}) - H{A U {q}\Y)) - ( H(A) - H(A\Y )) 

= (H(A U {q}) - H(A)) - (h(A U {q}\Y) - H(A\Y )) 

= H{q\A)-H{q\A-Y) (15) 

= H(q\A)-H(q\Ai-,Y) (16) 

Step 115| simply replaces the singleton set {< 7 } with the random 
variable q. Step^Juses the fact that A = Ai U Af and the condi¬ 
tional independence of q and AI given Y. 

Now, to prove submodularity, we need to show A 9 (q|A) > 
A g (g|A'), given by: 

A g (q\A) - A g {q\A') 

= ( H(q\A ) - H(q\Ai, Y)) - ( H{q\A') - H(q\Ai, Y)) 

(17) 

= H(q\A) - H(q\A') 

> 0 (18) 

Step [IT] uses the conditional independence of q and AI given Y. 
Note that a crucial property used in this step is that s £ AI for 
this case. Step| 18|foll ows from the “information never hurts” prin¬ 
ciple ( |Cover and Thomas 20 12} thus proving the desired result and 
completing the proof. □ 


ALGORITHM 2 . Greedy for general submodular function 

1 Input: budget B, set V, function / 

2 Output: set S' GRhhnY 

3 Initialization: set S = 0 , iterations r = 0 , size l = 0 

4 while V 7^ 0 do 

v* = arg max„g y ( /(SU t~ /(S) ) 
if c(S) + c„* < B then 
S = S U{v*} 

1 = 1 + 1 


9 

10 


v = v\ K} 

r = r + 1 


= S 


12 return S ' 


Proof of Theorem [2] 

Proof of Theorem [2] In order to prove Theorem[2] we first consider 
a general submodular set function and prove the approximation 
guarantees for the greedy selection scheme under the assumption 
that the cost to budget ratio is bounded by 7 . 

Let V be a collection of sets and consider a monotone, non¬ 
negative, submodular set function / defined over V as / : 2 V —► 
R. Each element v £ V is associated with a non-negative cost Cv. 
The budgeted optimization problem can be cast as: 

S * = arg max f(S) subject to c s < B 

s ^ v 7ft 


Let S° FT be the optimal solution set for this maximization prob¬ 
lem, which is intractable to compute {Feige 1998[ >. Consider the 
generic GREEDY selection algorithm given by Algorithm [2] and 
let 5 ,Greedy be the set returned by this algorithm. We now an¬ 
alyze the performance of GREEDY and start by closely follow¬ 
ing the proof structure of (Khuller, Moss, and Naor 1999; Sviri- 
|denko 2004| >. Note that every iteration of the Algorithm [2] can be 
classified along two dimensions: i) whether a selected element v* 
belongs to S l0pT or not, and ii) whether v* gets added to set S 
or not. First, let us consider the case when v* belongs to S° pr , 
however was not added to S because of violation of budget con¬ 
straint. Let r be the total iterations of the algorithm so far, and l 
be the size of S at this iteration. We can renumber the elements 
of V so that Vi is the i th element added to S for i £ 


and V1+1 is the first element from .S ) 7 selected by the algo¬ 
rithm that could not be added to S. Let Si be the set obtained 
when first i elements have been added to S . Also, let c(S) denote 
^ gCC . c s . By using the result of (Khuller, Moss, and Naor 1999 
|Sviridenko 2004)l, the following holds! 


f(Si) - f(Si-i) > | ■ (/(S 0PT ) - /(&-!)) 

Using the above result, ( [Khuller, Moss, and Naor 1999 j |Svi7|7| 
|denko 2004] ) shows the following through induction: 


nsi)> (i-ri(i-i))-/(s° pT ) 

v j =1 J 

> f 1 - (i-E#!)') -/(S° PT ) (19) 

X j =1 / 

= ( 1 -( 1 -^) i )-/(5° PT) (20) 
























In Step 


19 


we use the property that every function of form I 1 — 


f]j =1 (l — achieves its minimum at ^1 — (l — /3) l ^j for 

B = s T l Ife- 

^ £-*‘j= 1 B l ' 

Now, we will incorporate our assumption of bounded costs, i.e., 
c v < 7 ■ B Vu G V, where 7 € (0,1) to get the desired results. We 
use the fact that budget spent by Algorithm [2] at iteration r when it 
could not add an element to solution is at least (B — max„cv c v ), 
which is lower-bounded by B( 1 — 7 ). Hence, the cost of greedy 
solution set c(S'z) at this iteration is at least B(l — 7 ). Incorporating 
this in Step[20] we get: 

nsi)> ( 1 - -/(5° pt ) 


^ <21 > 

This proves that the GREEDY in Algorithm[2]achieves a utility of at 
least ^1 — t/e* 1 —9^ times that obtained by optimal solution OPT. 
Given these results, Theorem[2]follows directly given the submod¬ 
ularity properties of the considered optimization function. O 







